[SPARK-9673][SQL] Sample standard deviation aggregation function#8058
[SPARK-9673][SQL] Sample standard deviation aggregation function#8058brkyvz wants to merge 18 commits intoapache:masterfrom
Conversation
|
Test build #40261 has finished for PR 8058 at commit
|
There was a problem hiding this comment.
I think it is better to call it stddev_samp because other databases have both stddev_samp and stddev_pop (population standard deviation).
|
I will take a look at the failed test case when SortBasedAggregation is used. |
|
My diagnosis is that count is updated for the first updateExpression. Then for calculating average, the updatedCount expression is used there with count + 1, therefore the average and moment calculations get messed up. |
|
I think the main problem is the way that |
There was a problem hiding this comment.
At here, deltaX means currentValue - previousAvg, right? If so, because we have already updated currentAvg, deltaX means currentValue - updatedAvg.
There was a problem hiding this comment.
For now, maybe we can add a deltaX field in the buffer to let you store the value of currentValue - previousAvg to workaround the problem.
There was a problem hiding this comment.
I will take a look at mutable projection and try to move field update part after evaluating expressions.
There was a problem hiding this comment.
That unfortunately doesn't totally solve the problem for SortBasedAggregation, and it corrupts the result for TungstenAggregation :( Because TungstenAggregation was happy with the way things were.
|
Test build #40309 has finished for PR 8058 at commit
|
|
Test build #40310 has finished for PR 8058 at commit
|
There was a problem hiding this comment.
Instead of changing how we compare double values, how about we change our tests by casting results to the decimal type?
|
Test build #40329 has finished for PR 8058 at commit
|
…ementation based on AggregateFunction2 if possible.
First resolve stddev functions to Hive's GenericUDAF and then replace them to our native functions.
|
Test build #40391 has finished for PR 8058 at commit
|
|
retest this please |
|
Test build #40409 has finished for PR 8058 at commit
|
|
Test build #40416 has finished for PR 8058 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #1450 has finished for PR 8058 at commit
|
|
Test build #1451 has finished for PR 8058 at commit
|
|
Do we want to support population variance? I don't think it is necessary to make two methods. R supports only sample variance, which is sufficient. It would be simpler if we implement sample variance first and then wrap |
|
Since most database systems do, I think we have to support it as well, since it's pretty simple to go from one or the other on our side |
Make describe work in SQLContext.
|
Test build #40486 has finished for PR 8058 at commit
|
|
Test build #40490 has finished for PR 8058 at commit
|
This PR adds the sample standard deviation as a udf, and a grouped aggregate function for SQL. It now works for TungstenAggregation with codegen, but not for SortBasedAggregation. I have some
printlns in the code for debugging purposes and need help from @yhuai @marmbrus and @rxin for this to work for both methods...cc @mengxr